Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix travis - job timeout and test timeout #203

Merged
merged 3 commits into from
Jul 19, 2016
Merged

Fix travis - job timeout and test timeout #203

merged 3 commits into from
Jul 19, 2016

Conversation

zachyee
Copy link
Contributor

@zachyee zachyee commented Jul 7, 2016

Created more jobs to fix the case where a single job would take more than 50 minutes to complete.

Moved creating the docker images for the product tests to before the test executes rather than during the test itself. This helps tests avoid their 10 minute limit for not outputting anything.

Added some logging that can be used in the future to figure out why tests are taking a long time (if this ever happens).

Works successfully with presto version 148, but not with the newest version. I can hardcode the URL where the tests download the presto rpm to fetch the 148 version rather than the newest version. This might be useful to test the release? (This PR fetches 149, not 148, so it will fail on travis)

- PYTHONPATH=$(pwd)
matrix:
- OTHER_TESTS=true
- PRODUCT_TEST_GROUP_SA_BARE="tests/product/test_installer.py tests/product/standalone/test_installation.py"
Copy link
Contributor

@petroav petroav Jul 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does SA stand for?

EDIT: probably stands for standalone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

standalone

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not clear that SA means that, what about YS? Please do not use shortcuts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, maybe you can rename the groups to STANDALONE_BARE_TESTS and YARN_SLIDER_TESTS. Including the PRODUCT_TEST_GROUP_* prefix is a little too verbose and redundant.

@zachyee zachyee force-pushed the fix_travis branch 2 times, most recently from ed95db2 to c11c36d Compare July 7, 2016 23:32
- PRODUCT_TEST_GROUP=11
- PRODUCT_TEST_GROUP=12
- PRODUCT_TEST_GROUP=13
- PRODUCT_TEST_GROUP=14
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

travis not only fails because of timeouts. Travis has low amount of resources. There are tests which spin up a docker cluster which consists of 4 docker containers where each tries to start presto. In such case travis kills some processes as the memory usage is exceeding travis capacity. I think this is the most serious issue.

Moreover, when a process is killed, then presto-admin waits until timeout to reports that presto-server didn't show up. Thus making tests last longer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to fix jobs going over 50 minutes and failing. I wasn't considering Travis memory usage as part of this PR.

@kokosing
Copy link
Contributor

kokosing commented Jul 8, 2016

Travis didn't pass ; /

@zachyee
Copy link
Contributor Author

zachyee commented Jul 8, 2016

haha yea i noted when i made the PR that travis still wouldn't work with the newest version of presto (which this branch uses). i created an experiment branch that uses 148 and those builds have worked.

"standalone bare, and yarn-slider presto-admin only images",
action="store_true")

if len(argv) <= 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is best to re-write the argument parsing to use a positional argument. That way you can more succinctly specify that you require at least one argument and you can specify what the available choices are. Here's what I propose:

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("image_type", metavar="image_type", type=str, nargs="+",
                        choices=["standalone_presto", "standalone_pa", "standalone_bare",
                                 "yarn_slider_pa", "all"], help="Specify the type of "
                        "image to create. The available choices are: standalone_presto, "
                        "standalone_pa, standalone_bare, yarn_slider_pa, all")

    image_builder = ProductTestImageBuilder()

    args = parser.parse_args()
    if "all" in args.image_type:
        image_builder.test_setup_standalone_pa_only_images()
        image_builder.test_setup_standalone_pa_only_images()
        image_builder.test_setup_standalone_bare_images()
        image_builder.test_setup_yarn_slider_pa_only_images()
    else:
        if "standalone_presto" in args.image_type:
            image_builder.test_setup_standalone_presto_images()
        if "standalone_pa" in args.image_type:
            image_builder.test_setup_standalone_pa_only_images()
        if "standalone_bare" in args.image_type:
            image_builder.test_setup_standalone_bare_images()
        if "yarn_slider_pa" in args.image_type:
            image_builder.test_setup_yarn_slider_pa_only_images()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure this seems a little cleaner

@petroav
Copy link
Contributor

petroav commented Jul 8, 2016

You should squash pretty much all of your commits. At the moment they're broken down at a level that is too granular. I think you should have two commits: one that refactors travis.yml as well as the refactor of BaseProductTest and another commit that introduces the timing decorator.

I would also like @ebd2 to take a look at this when he gets back on Monday.

- PRODUCT_TEST_GROUP_YS_PA="tests/product/yarn_slider/test_slider_installation.py"
#-QUARANTINED="tests/product/yarn_slider/test_server_install.py"
install:
- pip install --upgrade pip==6.1.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You and @kokosing are going to need to coordinate some on this. See #195

elif [ -v OTHER_TESTS ]; then
make clean lint dist docs test-rpm
nosetests --with-timer --timer-ok 60s --timer-warning 300s -s tests.unit
nosetests --with-timer --timer-ok 60s --timer-warning 300s -s tests.integration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no need to set timers for unit and integration tests, they are really short.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

@kokosing
Copy link
Contributor

20 jobs travis sounds like an overkill, note that you are running tox -e p26 on machines which requested python 2.7

from time import time


def _initialize_module_logger(module_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I suggest we remove this method completely and make this piece of code execute at the module level. In the functions where you refer to the logger variable make sure you include the line global logger. Also, I suggest we hardcode the module to "root". I think that's fine because the reason why you would want to have different loggers for every module is to be able to adjust the logging level and logging destination at a more granular level whereas we just print everything at INFO to the console.

I did some experiments and determined that if the code is moved to be module level it will only get executed once.

If a product test case is taking a long time, this decorator
can be added to the test and some of its helper function to
see where and why the function is taking a long time.
@zachyee
Copy link
Contributor Author

zachyee commented Jul 15, 2016

@kokosing good point. I think the tests are grouped as optimally as possible, given how long they take right now. I think the only solution is to rework the tests to run faster (like you said). It also doesn't help that the tests have to be run with Python 2.6, 2.7, which automatically doubles the number of jobs per build. There's no way to work around this is there?

@petroav is there a ticket filed somewhere for trying to make the tests fun quicker so we can have less jobs per build or should I do that? I also fixed up the logging in timing_test_decorator.py. You can double-check to make sure everything looks good there.

@ebd2 i separated image_builder.py to just build images (_setup_image()) and gave the BaseProductTestCase a setup_cluster() function to use. Can you double-check that all the logic is correct for both of those functions?

@petroav
Copy link
Contributor

petroav commented Jul 17, 2016

@zachyee nah, there's no ticket for speeding up the tests. Feel free to create one but don't work on it. If it takes 20 jobs to bring each run to under 50 minutes, or whatever the timeout is, then so be it. Have you tried with fewer jobs? Things might run faster now with the bigger machines so you should definitely see if you can optimize it.



class ImageBuilder:
def __init__(self, caller):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name caller suggests that it can be something pretty generic. I think we should do the following:

  1. Let this take any testcase (and rename caller -> testcase)
  2. Extract the cluster types into tests/product/cluster_types.py and import them here and in base_product_case.

That separates the two dependencies (testcase for assertions and cluster definitions) such that they can be satisfied independently.

@ebd2
Copy link
Contributor

ebd2 commented Jul 18, 2016

Looks good modulo the small refactor and the naming nit.

@kokosing
Copy link
Contributor

@zachyee Please make sure that the problem from prestodb/presto#5689 does not exists in your travis.yml script.

Test grouping is ok. Thanks.

Using tox and travis to provision different python version is redundant. We should choose one of these, I prefer travis. See #201. You are using tox -e py26... so even two version of python are requested only py26 is tested.

matrix:
- OTHER_TESTS=true
- SHORT_PRODUCT_TEST_GROUP=0
- LONG_PRODUCT_TEST_GROUP_STANDALONE_PRESTO_ADMIN="tests/product/test_server_install.py"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd get rid of the word standalone in all the test group names. It's the only way to run presto-admin right now, and makes the name really long.
(Sorry for chiming in so late, I've been using this patch to be able to run tests on travis, and noticed this. Also--works great! thanks!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@zachyee
Copy link
Contributor Author

zachyee commented Jul 19, 2016

I combined a couple of job groups to get rid of 4 jobs (from 18 to 14). That's the best I can do until tests are faster or someone reaches out to Travis to increase the 50-minute timeout limit.

Did the cluster_type refactor and renaming.

@kokosing thanks for pointing that out. I ran into a similar problem when it wasn't erroring out from lint errors. I went ahead and && commands together so it fails when the first command fails. It works fine for presto-admin because I made it one long if, elif, ... else script, whereas presto needs a bunch of separate scripts with single if statements. If this changes, we can use the fix introduced to presto, but I think this way is more readable. I also removed using tox and invoked nosetests directly. It will use whatever version of python is installed on the travis machine, so we can test 2.6, 2.7.

except:
# retry once
rpm_name = StandalonePrestoInstaller._download_rpm()
raise OSError(1, 'Presto RPM not detected.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you have a testcase where this gets called in the constructor. Does it make sense to pass it in here and testcase.fail() if the RPM doesn't exist?

It might not because it isn't a real test case, if it's messier that way, I'm fine with it as-is.

@ebd2
Copy link
Contributor

ebd2 commented Jul 19, 2016

LGTM!

@rschlussel-zz
Copy link
Member

I wouldn't worry about decreasing the number of jobs further. It's better for it not to need so many jobs, but having the whole thing complete in a reasonable amount of time is more important.

@petroav
Copy link
Contributor

petroav commented Jul 19, 2016

Looks good

@zachyee zachyee merged commit 89c6870 into master Jul 19, 2016
@zachyee zachyee deleted the fix_travis branch July 19, 2016 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants